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Abstract 

To characterize the genetic variation of alternative splicing, we develop GLiMMPS, a robust statistical method for 
detecting splicing quantitative trait loci (sQTLs) from RNA-seq data. GLiMMPS takes into account the individual 
variation in sequencing coverage and the noise prevalent in RNA-seq data. Analyses of simulated and real RNA-seq 
datasets demonstrate that GLiMMPS outperforms competing statistical models. Quantitative RT-PCR tests of 26 
randomly selected GLiMMPS sQTLs yielded a validation rate of 100%. As population-scale RNA-seq studies become 
increasingly affordable and popular, GLiMMPS provides a useful tool for elucidating the genetic variation of 
alternative splicing in humans and model organisms. 
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Background 

Alternative splicing (AS) is the process by which exons 
from precursor mRNA transcripts are differentially 
included during splicing, resulting in different mature 
mRNA isoforms from a single gene locus [1]. AS is a 
major contributor to the control of gene expression and 
protein diversity. More than 90% of human genes are 
alternatively spliced [2]. Changes in the relative ratio of 
alternatively spliced isoforms of a single gene can have 
significant phenotypic consequences and cause various 
diseases [3,4]. 

The control of AS is mediated through extensive pro- 
tein-RNA interactions involving cis regulatory elements 
and trans acting factors [5]. Genetic polymorphisms that 
alter cis splicing regulatory elements can result in differ- 
ence of alternative splicing among human individuals 
and subsequently affect gene expression or protein activ- 
ity. Increasing evidence suggests that such natural varia- 
tion of alternative splicing can influence complex traits 
or modify disease risks [6]. For example, genetic variation 
of alternative splicing in the sodium channel gene 
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SCN1A can influence the response to antiepileptic drugs 
[7]. To date, most genome-wide surveys of alternative 
splicing variation in human populations were carried out 
on the HapMap lymphoblastoid B cell lines (LCLs), 
whose genomic variants have been extensively character- 
ized by the HapMap [8] and 1000 Genomes projects [9]. 
The first few studies utilized the Affymetrix exon array 
with approximately 6 million exon-targeted probes 
[10-12]. In these studies, the microarray probe intensities 
of individual exons were compared to those of whole 
genes to quantify exon inclusion levels and then associa- 
tions with single- nucleotide polymorphisms (SNPs) were 
tested to identify splicing Quantitative Trait Loci 
(sQTLs). Another study used the same exon array plat- 
form to characterize tissue-specific control of alternative 
splicing in brain and peripheral blood mononuclear cell 
samples [13]. These studies have shed light on the preva- 
lence and functional importance of alternative splicing 
variation in human populations. The development of the 
high-throughput RNA sequencing (RNA-seq) technology 
has provided a powerful alternative to splicing sensitive 
microarray for exon level expression quantification. 
RNA-seq has several advantages compared to microarray, 
including a greater dynamic range of exon expression 
levels, the ability to detect novel transcripts not probed 
on the array, the ability to better quantify exon inclusion 
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levels, single nucleotide level resolution, and less con- 
founding effects from polymorphisms on the target exons 
[14,15]. Several studies have used the RNA-seq technol- 
ogy to characterize transcriptome variation in HapMap 
LCLs at the whole-gene and/or individual exon level. 
Pickrell et al. and Montgomery et al. used low-coverage 
(4-25 million short reads per individual) single-end and 
paired-end RNA-seq to characterize gene expression and 
splicing in LCLs derived from 69 Nigerian [16] and 60 
CEU (Utah residents of European descent from CEPH- 
Centre d'Etude du Polymporphisme Humain) [17] indivi- 
duals. Cheung et al. independently generated an RNA- 
seq dataset on 41 CEU individuals at a deeper coverage 
of 28.4-66 million single-end reads per individual, 
although the authors restricted their data analysis to 
expression QTLs [18]. 

Despite the novel findings in these pioneering RNA-seq 
studies, the statistical models applied for sQTL detection 
were simple linear regression models (lm) and did not 
model all the relevant information contained in the com- 
plex RNA-seq data. Montgomery et al. used the exon read 
counts as the phenotype and carried out spearman corre- 
lation analysis with the genotypes [17], while Pickrell et al. 
used the percentage of the exon read counts over total 
gene read counts as the quantitative trait and carried out 
linear regression over genotypes [16]. Neither approach 
directly estimated the percent inclusion levels of target 
exons. Moreover, by treating the exon expression mea- 
surement as a point estimate, neither approach considered 
the variability of RNA-seq read count that strongly affects 
the uncertainties in estimates of exon splicing activities 
[14]. Here we report a novel method GLiMMPS (General- 
ized Linear Mixed Model Prediction of sQTL) for robust 
detection of sQTLs from RNA-seq data. The GLiMMPS 
model takes into account the individual variation of exon- 
specific read coverage as well as the prevalent overdisper- 
sion of simple statistical models when applied to RNA-seq 
data [19,20]. Importantly, GLiMMPS uses the reads infor- 
mation from both exon inclusion and skipping isoforms to 
model the estimation uncertainty of exon inclusion level, 
instead of treating the exon inclusion level as a point esti- 
mate in sQTL analysis (see Materials and methods and 
Figure 1 for details). Using both simulated and real RNA- 
seq datasets, we demonstrate that GLiMMPS outperforms 
competing statistical models (linear model and generalized 
linear model), and identifies sQTLs at a low false positive 
rate as indicated by extensive RT-PCR tests. 

Results 

AS in the human population measured from RNA-seq 
data 

We obtained the RNA-seq data from two published 
studies on the CEU population (of European ancestry) 
by Cheung et al. [18] and Montgomery et al. [17]. 



Cheung et al. generated 28.4-66 million 50 bp single- 
end reads per individual on 41 CEU samples, while 
Montgomery et al. generated 3.5-17.1 million 37 bp 
paired-end reads per individual on 60 CEU samples. 
Twenty-nine individuals were shared between the two 
datasets. Because of the higher sequencing depth in the 
Cheung et al. dataset, the analysis in this manuscript 
was primarily conducted on the Cheung et al. data 
(referred to hereafter as the CEU dataset). We also used 
the low-coverage Montgomery et al. data (referred to 
hereafter as the CEU2 dataset) to evaluate the concor- 
dance of results between the two CEU sample datasets. 

The RNA-seq reads were mapped to the human gen- 
ome (hgl9) and transcriptome (Ensembl gene annota- 
tion r65) using the software Tophat [21]. To estimate 
the exon inclusion level (denoted as y/ for PSI, that is 
Percent Spliced In) from RNA-seq data, we used 
sequence reads mapped to splice junctions compiled 
from both the splice junctions in Ensembl gene annota- 
tions as well as the novel junctions found by Tophat. 
Based on the AS patterns, we classified the AS events 
into four categories (Figure SI in Additional file 1): 
skipped exon (SE), alternative 5' splice site (A5SS), alter- 
native 3' splice site (A3SS), and mutually exclusive 
exons (MXE). Using all splice junction reads, we can 
obtain a point estimate of the exon inclusion level 
We illustrate the estimate of yi in our model using the 
SE event as the example (Figure la). Suppose // and SJ 
represent read counts of inclusion and skipping splice 
junctions, respectively, because // can come from both 
the upstream junction and the downstream junction, we 
treat the effective read count from the exon inclusion 
isoform y = IJ/2 and the effective read count from the 
exon skipping isoform as SJ. Given an observed total 
junction read count of n = IJ/2+SJ , the point estimate 
of i/r = y/n- The median and coefficient of variation (CV) 
of {fr of skipped exons from CEU and CEU2 (with 
|Ai/f| > 0.1 within each of the two populations, see 
Materials and methods) are highly correlated with a 
Pearson correlation coefficient of 0.99 and 0.90, respec- 
tively, suggesting that the point estimate of i/r provides a 
reasonable approximation to the exon inclusion level. 
However, we also noted that the total counts of splice 
junction reads for the same alternatively spliced exon 
typically vary substantially across different individuals 
(Figure S2 in Additional file 1), possibly due to the 
intrinsic randomness of RNA-seq technology as well as 
individual variation in gene expression levels. Such 
variability of read depth is expected to differentially 
affect the reliability of {jr estimates across individuals. 
This motivated us to develop an improved statistical 
model that explicitly considers the variation of RNA-seq 
read depth across individuals. 
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Figure 1 Schematic outline of GLiMMPS. (a) RNA-seq reads mapped to splice junctions of alternatively spliced exons are used for estimating 
exon inclusion levels i//. Shown here is a schematic illustration using the skipped exon (SE) type of alternative splicing events as the example. 
White, sQTL target exon; black and gray, flanking exons. The inclusion junction (IJ) reads consist of reads mapped to the upstream and 
downstream splice junctions of the exon inclusion isoform, while the skipping junction (SJ) reads are reads mapped to the skipping splice 
junction of the exon skipping isoform. (b) Illustration of the GLiMMPS statistical model. SNP genotype effect is modeled as fixed effect fy. The 
overdispersion is modeled as individual level random effect Wy. 



Statistical model and simulation study of GLiMMPS 

We first attempted to handle the individual variation of 
RNA-seq read depth by extending the previously used 
linear model (lm) [16] to a generalized linear model 
(glm) with a logit link function, which assumes the read 
count from the exon inclusion isoform (y) follows a 
binomial distribution y\x[r ~ Binomial{n, i/r), and logit(yy) 
is linearly modeled by the SNP effect. This simple logis- 
tic regression model assumes that y/ is correctly mod- 
eled and thus: £(y,) = n,-^,-, and Vartyi) = n^ifl - tyi)- 
However, we found that overdispersion (inflation of var- 
iance) is widespread in the experimental data (Supple- 
mentary Methods in Additional file 1). For the top 
sQTLs (Type I error <1% based on permutation) identi- 
fied from glm in the CEU dataset, >90% sQTLs have sig- 
nificant overdispersion (Figure S3 in Additional file 1). 

To model the overdispersion, we developed GLiMMPS, 
a generalized linear mixed model for detecting sQTLs. 
To deal with the overdispersion in the generalized linear 
model, we model the extra variance of y/ as a random 
effect for each individual i in the regression model with 
random effects, My ~ N(0, ofy [22]. Let u y = a u) Z i), where 
Zij ~ N(0, 1), pj denoting the fixed effect for SNP /', the 
second level of the model can be written as: 
x[fi = logit _1 (/8 0 + fijgij + cr U jZij). GLiMMPS is essentially a 
hierarchical model that considers both the read depth 
variation and the exon inclusion level variation within 
the same genotype groups (Figure lb). Details of the lm, 



glm, and GLiMMPS models were described in Materials 
and methods and Supplementary Methods in Additional 
file 1. 

We first conducted simulation studies to compare the 
power and robustness of GLiMMPS to lm and glm. We 
simulated splice junction read counts with various levels 
of read depth, difference of exon inclusion levels among 
genotype groups, and overdispersion mimicking the 
parameter distributions in the CEU dataset (Figure S4 in 
Additional file 1). We simulated 10,000 data points for 
each read depth with mean total splice junction reads 
ranging from 5 to 80. Data were simulated with 20% 
data points having genotype effects as distributed from 
the CEU dataset and the remaining 80% having no dif- 
ference in exon inclusion levels among genotypes (see 
details in Supplementary Methods in Additional file 1). 
Note that the simulation data generated through this 
procedure are not inherently biased towards any of the 
statistical models tested. Using the 80% simulated data 
points with no SNP effect under various read depth, we 
evaluated the false positive rates (type I errors) at 5% 
significance level. The false positive rates of GLiMMPS 
and lm are always close to the nominal significance 
level, while glm has a highly inflated false positive rate, 
especially for data with large total splice junction reads 
(Figure 2a). This confirms that it is essential to incorpo- 
rate overdispersion in the hierarchical model to avoid 
the inflation of P values. We also computed the receiver 
operating characteristic (ROC) curves by combining all 
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Figure 2 Performance evaluation of different statistical models using simulated data (a) The observed false positive rate at the 
significance level of 0.05 for the linear model (Im), generalized linear model (glm), and GLiMMPS. Data were simulated with different sequencing 
depth with mean total junction reads ranging from 5 to 80, as described in Supplementary Methods in Additional file 1. (b) Receiver operating 
characteristic (ROC) curve analysis demonstrates that GLiMMPS outperforms the Im and glm models. The ROC curve plots the fraction of true 
positives called correctly and the fraction of false positives called incorrectly using P-values from each model. The zoomed-in figure shows the 
part of the ROC curve where the false positive rate is in the range of (0, 0.2). 



the simulated data with or without SNP effects. The 
ROC curves show that GLiMMPS outperforms the lm 
and glm models (Figure 2b), especially in the most criti- 
cal part of the ROC curve where the false positive rate 
is low. The true positive rate of GLiMMPS is approxi- 
mately 5% to 20% higher than those of lm and glm 
when the false positive rate ranges from 0.01 to 0.1 
(Figure 2b, inset). Furthermore, to model the non- 
uniformity and bias in sequence-specific sequencing pre- 
ferences in RNA-seq data, we performed an additional 
simulation analysis. Specifically, for each exon inclusion 
or skipping splice junction we rescaled the original 
simulated count by a random scaling factor ranging 
from 0.5 to 2, with 10% variation in the scaling factor for 
the same splice junction across different individuals. We 
observed no change in the performance of GLiMMPS as 
compared to lm and glm (data not shown). 

Performance of GLiMMPS in real human RNA-seq data 

To further assess the performance of GLiMMPS, we ana- 
lyzed the two human RNA-seq datasets on CEU LCL sam- 
ples (CEU and CEU2) using the GLiMMPS, lm, and glm 
models. As previous studies suggested that the signal 
SNPs for most sQTLs are near the target exons [11,16], 
we carried out sQTL analysis for all common SNPs 
(minor allele frequency >0.05) within 200 kb from alterna- 
tively spliced exons with a median of at least 5 total splice 



junction reads in both CEU and CEU2 samples. We used 
permutation to determine the null distribution of minimal 
P values of SNPs near exons. Subsequently we applied the 
false discovery rate (FDR) correction to establish a cutoff 
P value corresponding to the FDR level of 0.1 (see details 
in Supplementary Methods and Figure S5 in Additional 
file 1). This yielded 140 unique AS events in 106 genes 
with significant sQTL signals in the CEU dataset (Addi- 
tional file 2). Because of the lower sequencing depth, there 
were a smaller number (56) of significant sQTLs identified 
by GLiMMPS in the CEU2 dataset. Nonetheless, the sig- 
nificant sQTL signals identified by GLiMMPS are strongly 
correlated between the two datasets (Figure 3a). Among 
the 56 significant sQTLs (FDR <0.1) in CEU2, 39 (70%) 
are also significant in CEU. Although there is a larger 
proportion of significant sQTLs in CEU showing no signif- 
icance in CEU2, it is most likely due to the lower sequen- 
cing depth in CEU2. To quantitatively compare the 
relative rankings of sQTLs identified by different models 
(GLiMMPS, lm, and glm) in CEU and CEU2, we calcu- 
lated the proportion of sQTL exons among the top n most 
significant in CEU that were also among the top n in 
CEU2 (n ranges between 20 and 160). Compared to lm 
and glm, GLiMMPS produces a much higher concordance 
of rankings between the two datasets, especially for the 
top 60 sQTLs which correspond to approximately 10% 
FDR in the CEU2 dataset (Figure 3b). 
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CEU -log 10 (P) Number of top sQTLs 

Figure 3 Concordance of sQTLs in two RNA-seq datasets of the Caucasian (CEU) population as obtained by different statistical 
models, (a) Comparison of GLiMMPS P values for the most significant SNP of each alternatively spliced exon in the CEU and CEU2 datasets. 
X-axis shows the -log q0 (P value) in CEU. Y-axis shows the -log 10 {P value) in CEU2. Red lines show the FDR cutoff of 10%. (b) Concordance of 
sQTL rankings between CEU and CEU2 based on different statistical models. The x-axis represents the number of top n ranked sQTLs in each 
dataset, while the y-axis represents the percentage of sQTLs in common between the two datasets among the top n sQTLs in CEU, based on 
P value rankings calculated by the linear model (Im), generalized linear model (glm), and GLiMMPS. 



To experimentally assess the robustness of GLiMMPS 
predictions, we randomly selected 24 SE (skipped exon) 
type and 2 A5SS (alternative 5' splice site) type of 
sQTLs out of the 140 significant sQTLs detected in the 
CEU dataset and performed RT-PCR validation using 
quantitative fluorescent RT-PCR (Materials and meth- 
ods). For the validation experiments, we used an inde- 
pendent panel of 86 HapMap LCLs covering diverse 
worldwide populations (Additional file 3). All 26 sQTLs 
were validated, yielding a validation rate of 100% (Addi- 
tional file 4; Figure S6 in Additional file 1). In eight indi- 
viduals analyzed by both RNA-seq and RT-PCR, the 
exon inclusion levels estimated by RNA-seq were highly 
correlated with RT-PCR measurements (Pearson corre- 
lation coefficient r = 0.87). It is noteworthy to mention 
that these 26 selected sQTLs have a wide range of 
P value rankings among the 140 significant sQTLs, as 
opposed to being selected from the top of the significant 
sQTL list. The interquartile range of their rankings is 38 
to 95. This suggests that the vast majority of the sQTLs 
identified by GLiMMPS represent true signals of spli- 
cing variation in human populations. 

GLiMMPS reveals positional features of sQTLs 

Next we examined the positional distribution of SNPs 
associated with significant sQTL signals in the CEU 



dataset. It should be noted that the genotype informa- 
tion for the CEU dataset came from both HapMap and 
1000 Genomes project data, thus they capture the 
vast majority of common SNPs in the human genome. 
Consistent with previous sQTL studies using arrays and 
lower-density HapMap SNPs [11,12], sQTL signal SNPs 
with a GLiMMPS P value <3.70E-06 (corresponding to 
FDR <0.1) are centered around the splice sites (SS) of 
target exons. A local examination of the SNP positions 
for the 140 significant GLiMMPS sQTLs indicates that 
the precise locations of these SNPs are strongly corre- 
lated with their potential impacts on splicing. As we 
increased the stringency of the P value cutoff for signifi- 
cant sQTLs, we observed a steady increase of the pro- 
portion of sQTLs with at least one significant signal SNP 
within 300 bp of the splice sites (Figure 4a). The turning 
point is around FDR = 0.1, where only around 20% of 
sQTLs have no significant signal SNPs discovered within 
300 bp of the splice sites. To further evaluate the correla- 
tion between SNP positions and potential impacts on 
splicing, we classified all cis SNPs within 200 kb of the 
sQTL exons into five categories according to the SNP 
location relative to the splice site, where 5' SS represent 
the nine bases of the 5' splice site including six bases in 
intron and three bases in exon, and 3' SS represent the 
23 bases of the 3' splice site including 20 bases in intron 
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Figure 4 Positional distribution of sQTL SNPs. (a) The fraction of sQTL exons with significant SNPs within 300 bp of the splice sites as a 
function of the P value cutoffs for significant sQTLs. X-axis is the -log 10 (P value) for cutoffs to define significant sQTLs. Y-axis is the fraction of 
sQTL exons with any SNP called significant within 300 bp of the splice sites, (b) The boxplot of GLiMMPS P values for all SNPs around the 140 
significant sQTLs (FDR <0.1), grouped into five categories based on the positions of SNPs with respect to the splice sites. 



and three bases in exon [23]. We observed a striking dif- 
ference in the distribution of sQTL P values for cis SNPs 
located in different regions (Figure 4b). Specifically, cis 
SNPs located within the 5' SS have the smallest overall 
P values, followed by SNPs within the 3' SS and exons, 
and intronic SNPs within 300 bp of the splice sites. SNPs 
located in the distal intronic regions (>300 bp from the 
splice sites) have the biggest overall sQTL P values, sug- 
gesting that they are least likely to affect splicing. This 
trend is consistent with the observation by Pickrell et al. 
showing the enrichment of sQTL signal SNPs in splice 
sites [16], but with a finer classification of SNP locations. 

To definitively identify causal SNPs underlying signifi- 
cant sQTLs, we tested the effects of individual SNPs on 
splicing using minigene reporter assays. It should be noted 
that since multiple SNPs can be in high linkage disequili- 
brium (LD) with each other, an sQTL signal SNP with 
high association to exon splicing may not necessarily be 
the causal SNP that affects splicing regulation. In fact, the 
140 significant sQTLs (FDR <0.1) have on average 63 sig- 
nificant SNPs. Of the 26 RT-PCR validated sQTLs, the 
causal SNPs in four genes (CAST, DHRS1, HMSD, and 
A TP5SL) were confirmed previously in work by us [24] 
and others [11]. The causal SNPs in CAST, DHRS1, and 
HMSD are located in the 5' SS [24], while the causal SNP 
in A TP5SL is located in the exon and disrupts two putative 
exonic splicing enhancers [11]. For the remaining RT-PCR 
confirmed sQTLs, we randomly selected 14 for minigene 



experiments. Briefly, the target exon and 350-500 bp of 
surrounding intronic sequences on each side of the exon 
were sub-cloned into the minigene expression vector and 
site-directed mutagenesis was carried out to generate the 
alternative alleles. After transiently transfecting these plas- 
mids into HEK293 cells, we performed quantitative RT- 
PCR analysis of wild-type and mutant minigene reporters 
to determine the effect of the SNPs on exon inclusion 
levels (see details in Materials and methods). In 10 of the 
14 exons analyzed (NTPCR, KIAA1841, SP140, ITM2C, 
PARP1S, PTK2B, BCL2A1, SHMT1, ITPA, and ARFGAP3), 
the minigene experiments identified at least one SNP that 
caused >10% change of the minigene exon inclusion levels, 
with the direction of change matching the RNA-seq/RT- 
PCR data (Figure S7 in Additional file 1). These include 
two exons where we found multiple SNPs with additive 
effects on splicing within one LD block {KIAA1841) or 
multiple LD blocks (ITPA). In another two exons (PPIL3 
and NCAPG2), the minigene experiments failed to identify 
any SNP with strong effect on splicing. For PPIL3, the 
SNP rslll292412 in the 3' SS affected splicing of the mini- 
gene reporter in the same direction as in the RNA-seq 
data, but the change was minor (from 9% in AA to 5% in 
GG). In NCAPG2, the closest sQTL SNP was an intronic 
SNP 347 bp away from the splice sites, and it did not have 
any measurable impact on the splicing of the minigene 
reporter. It is possible that another proximal or distal SNP 
or indel not genotyped yet is responsible for this sQTL 
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signal. Finally, in the last two exons analyzed (CLEC2D 
and MX1), the minigene exon inclusion levels were close 
to 100% for all alleles, suggesting that the cloned minigene 
reporters transfected to the HEK293 cells did not faithfully 
recapitulate endogenous exon splicing activities in the 
LCLs. Taken together, despite the inherent limitations of 
minigene reporter systems [25], we were able to use mini- 
gene experiments to identify the causal SNPs underlying 
10 of the 14 sQTL signals analyzed. In all 10 exons, the 
causal SNPs confirmed by minigene analysis were located 
proximal to the alternatively spliced exons (that is, within 
300 bp of the splice sites). 

sQTLs explain GWAS signals of human traits and diseases 

A powerful application for characterizing human tran- 
scriptome variation such as eQTLs and sQTLs is to inter- 
pret signals from GWAS studies [26-28]. Although 
GWAS have had great success in identifying numerous 
disease susceptibility loci, the peak signal SNPs identified 
by GWAS provided little information about the underly- 
ing causal variants or the molecular mechanisms respon- 
sible for the observed association [29]. Compelling 
evidence indicates that a large fraction of the underlying 
causal variants affect phenotypes via non-coding (for 
example, influencing gene regulatory processes such as 
transcription and RNA processing) as opposed to coding 
(direct amino acid changes) mechanisms [30,31]. The 
important role of alternative splicing in shaping the 
human transcriptome diversity suggests sQTL SNPs may 
represent the causal variants underlying many observed 
GWAS signals. Indeed, previous studies of alternative 
splicing variation using RT-PCR, array, and sequencing 
based technologies have identified candidate sQTLs 
linked to GWAS signals [11-13,32-36]. We investigated 
all significant sQTL SNPs (GLiMMPS FDR <0.1) in high 
(r 2 >0.8) linkage disequilibrium (LD) with GWAS signal 
SNPs listed in the Catalog of Published Genome-Wide 
Association Studies [37] (see Materials and methods). 
We identified 10 sQTLs strongly linked to GWAS SNPs 
of human traits or diseases (Table 1). The list include 
known splicing altering SNPs for CAST, ERAP2, and 
A TP5SL, as well as novel findings with intriguing biologi- 
cal and medical implications. 

In a previous GWAS study, the SNP rsl3160562 near 
CAST was discovered to be significantly associated with 
alcohol dependence [38]. However, no functional impli- 
cation of this SNP was discussed in the original study. 
Here, GLiMMPS identified this SNP as an sQTL signal 
SNP in CAST. It is significantly associated with the spli- 
cing of CAST exon 13 located 45 kb upstream of the 
SNP position. It is also in an LD block (r 2 = 0.53) with 
another SNP rs7724759 located in the 5' SS of exon 13, 
which has been confirmed experimentally to alter the 
splicing of this exon [11,24]. Thus, genetic variation of 



alternative splicing is the likely causal mechanism 
underlying the reported association of CAST and alcohol 
dependence. In ERAP2, GLiMMPS identified SNP 
rs2248374 as an sQTL signal SNP for exon 10. This 
SNP disrupts the activity of the 5' SS [11]. This sQTL 
SNP is in high LD (r 2 = 0.83) with a GWAS SNP 
rs2549794, previously identified as significantly asso- 
ciated with Crohn's disease [39]. The skipping of this 
alternatively spliced exon from ERAP2 introduces a pre- 
mature stop codon, resulting in nonsense-mediated 
decay of the exon skipping isoform and a dramatic 
reduction of overall transcript levels, which subsequently 
impacts antigen representation [40]. Haplotype analysis 
of the sQTL SNP and its linked SNPs revealed evidence 
of strong balancing selection during human evolution 
[40], suggesting the functional and evolutionary impor- 
tance of this sQTL. A third example is ATPSSL, identi- 
fied as a GWAS locus associated with height in multiple 
populations [41]. The peak signal SNP reported by 
GWAS is rsl7318596 but the mechanism of this SNP 
was unclear in the original study. GLiMMPS identified a 
significant sQTL for exon 5 of ATPSSL. The sQTL SNP 
rsl043413 is strongly linked to the GWAS signal SNP 
rsl7318596 (r 2 = 0.84) (Figure S8 in Additional file 1). 
This sQTL SNP rsl043413 is located in exon 5 and dis- 
rupts two exonic splicing enhancers (ESEs) [11]. Together, 
these data indicate that even at a very modest sequencing 
depth (28.4-66 million 50 bp single-end reads per indivi- 
dual), GLiMMPS recovered previously reported associa- 
tions between SNP and splicing that may contribute to 
phenotypic variation in humans. 

We also identified novel sQTL signals with interesting 
functional and disease implications. For example, we 
identified a novel sQTL signal in SP140 associated with 
previously identified GWAS signals for chronic lympho- 
cytic leukemia [42], multiple sclerosis [43], and Crohn's 
disease [39]. SP140 is a tissue-specific gene whose expres- 
sion is restricted to lymphoid cells [44]. Its protein 
domain structure suggests a role in chromatin-mediated 
regulation of gene expression [45]. A previous GWAS 
analysis of chronic lymphocytic leukemia identified a risk 
SNP rsl3397985 located in intron 1 of SP1 40. It was pro- 
posed that this GWAS signal SNP affects SP140 gene 
transcription [42] , but a recent replication study indicates 
that the association of this SNP to SP140 steady state 
gene expression levels is only marginal (FDR = 0.157 
after adjusting for multiple testing) [46]. It should be 
noted that the difference in gene expression levels among 
genotype groups is minor and marginal according to the 
CEU RNA-seq data as well {P value = 0.07). On the other 
hand, GLiMMPS found a novel significant sQTL signal 
for exon 7 of SP140 (Figure 5a). The peak sQTL signal 
SNP rs28445040 (GLiMMPS P value = 1.69E-14) located 
in exon 7 is in high LD with the GWAS signal SNPs for 
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Table 1 The list of sQTL signals linked to GWAS signals. 



Gene 



AS 

type 3 



Target exon (hg19) 



sQTL 
SNP C 



SNP type GWAS trait (SNP) 



GWAS 
references 



ACADM SE 
DRAM2 SE 
SP140 SE 



CAST SE 

ERAP2 A5SS d 

MRPL11 A5SS d 

ARL6IP4 A3SS e 



ULK3 



MXE 



ATP5SL SE 
ITPA SE 



+chr1 :761 94085-761 941 73 rs7524467 < = 300 

bp 

-chrl :1 1 1 6821 22- rs3762374 5' SS 

111682288 

+chr2:231 1 10577- rs28445040 Exon 

231110655 



+chr5:96076448-96076487 rs7724759 5' SS 
+chr5:96235824-96235949 rs2248374 5' SS 



-chrl 1:66206102-66206319 rs1 



10 



Exon 



+chr1 2:1 234661 17- rs55742290 3' SS 
1 23466426 

-chr15:75130091-75130139 rs12898397 5' SS 

-chrl 9:41 9391 76-41 939339 rs10434l3 Exon 

+chr20:3193814-3193872 rsl 127354 Exon 



Metabolic traits (rs21 171 8) [78] 

Liver enzyme levels (gamma-glutamyl transferase) [79] 
(rsl 335645) 

Chronic lymphocytic leukemia (rsl 3397985) [42] 

Multiple sclerosis (rsl 0201872) [43] 

Crohn's disease (rs742361 5) [39] 

Alcohol dependence (rsl 31 60562) [38] 

Crohn's disease (rs2549794) [39] 

Ankylosing spondylitis (rs301 87) [80] 

Bipolar disorder (rs2242663) [81] 

Platelet counts (rs729641 8, rsl 727307) [82] 

Coffee consumption (rs6495122) [83] 

Coronary heart disease (rs2472299) [84] 

Height (rsl 731 8596) [41] 

Response to hepatitis C treatment (rsl 1 6971 86, [50] 
rs61 39030) 

Ribavirin-induced anemia (rsl 1 27354) [85] 



a AS type: SE, skipped exon; A5SS, alternative 5' splice site; A3SS, alternative 3' splice site; MXE, mutually exclusive exons. 

b Exon coordinates are in hg19 with the start position 0 based and the end position 1 based. The direction (+/-) of transcription is denoted before the 
coordinates. 

The significant sQTL SNP (FDR<0.1) closest to the target exon. SNP position and P value from GLiMMPS can be found in Additional file 2. 
Alternative 5' SS: ERAP2, chr5:96235893; MRPL11, chrl 1:66206180. 
Alternative 3' SS: ARL6IP4, chrl 2:1 234661 41. 

'Mutually exclusive alternative exon: ULK3, chrl 5:751 30492-751 30533. 



chronic lymphocytic leukemia (rsl3397985, r 2 = 1), mul- 
tiple sclerosis (rsl0201872, r 2 = 0.92), and Crohn's dis- 
ease (rs7423615, r 2 = 1). The C to T mutation in 
rs28445040 does not lead to any amino acid change. 
However, according to RNA-seq data, the average exon 
inclusion levels for the CC, CT, and TT genotypes were 
96%, 77%, and 44%, respectively (Figure 5b). This trend 
was robusdy validated by RT-PCR experiments (Figure 5c). 
Furthermore, minigene assays confirmed the causal role 
of rs28445040 in regulating the splicing of SP140 exon 7 
(Figure S7 in Additional file 1). Collectively, these data 
strongly suggest that the SNP that alters the splicing of 
SP140 exon 7 is the causal genetic variant responsible for 
the reported associations with these diseases. The skipping 
of exon 7 causes an in-frame deletion of a 26 amino acid 
peptide segment from the SP140 protein product. Interest- 
ingly, this peptide segment is located within an intrinsically 
disordered region as predicted by IUPred [47]. Intrinsically 
disordered regions are enriched for sites of post-transla- 
tional modifications and protein-protein interactions, and 
two recent studies [48,49] show that alternative splicing of 
exons encoding disordered protein sequences frequently 
rewires protein-protein interaction networks in the pro- 
teome. In the future, it will be interesting to determine 
how alternative splicing of SP140 exon 7 regulates SP140 



protein functions and influences downstream cellular 
phenotypes. 

The identification of sQTLs can also help resolve appar- 
ent confusions about the causal mechanisms of GWAS 
signals. For example, the SNP rsll697186 located in gene 
DDRGK1 near the ITPA gene (Inosine Triphosphate Pyro- 
phosphohydrolase) was significantly associated with 
response to hepatitis C treatment in a GWAS study, and 
later was found to be in high LD with SNP rsl 127354 on 
ITPA exon 2 by fine mapping [50]. Of note, this C-to-A 
SNP (rsl 127354) on exon 2 has been well established in 
the pharmacogenetics field to be associated with ITPA 
enzyme deficiency or low-activity [51,52], but the molecu- 
lar mechanism was unclear. This non-synonymous SNP 
causes a proline to threonine change (P32T) in the IPTA 
protein product. However, based on the crystal structure 
of the human ITPA protein, the proline residue was far 
away from the active site of the enzyme [53]. Moreover, 
recent biochemical studies of ITPA showed that the puri- 
fied mutant protein with the P32T change has the same 
activity as the wild-type protein [54]. Others have pro- 
posed the alternative mechanism that this exon 2 SNP 
causes mis-splicing of ITPA [55], but the properties of 
the gene product resulting from mis-splicing have not 
been examined. Our analysis of the CEU RNA-seq data 
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Figure 5 An example of sQTL signal overlapping with GWAS signal near gene SP140. (a) The distribution of GLiMMPS P values around the 
sQTL exon (exon 7) in gene SP140. The black horizontal dashed line reflects the 10% FDR cutoff and red vertical lines mark the location of the 
sQTL exon. SNPs in linkage disequilibrium (r 2 >0.8 in the CEU population) with the GWAS SNPs (blue asterisks) are shown in solid black dots, 
while other SNPs are shown in grey circles. The causal splicing SNP in exon 7 is shown in red triangle. Exon-intron structure is shown in the 
bottom with GWAS SNPs and the causal splicing SNP (rs28445040) marked at corresponding locations, (b) Boxplot showing the significant 
association of rs28445040 with exon inclusion level (i//) of the SP140 exon 7 estimated by the CEU RNA-seq dataset. The size of each dot is 
scaled by the total number of splice junction reads for that individual, (c) The same boxplot using exon inclusion level ((//) measured by 
quantitative RT-PCR. 



identified the same ITPA exon 2 SNP as a significant 
sQTL signal SNP (P value = 5.80E-09) associated with the 
combined skipping of exons 2 and 3. This prediction is 
robustly validated by RT-PCR (Figure S9 in Additional 
file 1). Minigene experiments further confirmed that 
this exonic SNP as well as an adjacent intronic SNP 
(rs7270101) both reduced the inclusion levels of exons 2 
and 3 (Figure S7 in Additional file 1). These results rein- 
force the proposed effect of this ITPA SNP at the RNA 
level [55], and suggest that future studies on the causal 
mechanism of this ITPA gene variant should compare the 
activities of the full-length protein isoform to the trun- 
cated isoform that lacks exons 2 and 3. 

Discussion 

We have developed GLiMMPS, a generalized linear 
mixed model to detect genotype-splicing associations 
from RNA-seq data. The key advantage of GLiMMPS 
over previously used methods is that it models: (1) var- 
iation in exon-specific read coverage across individuals; 
and (2) overdispersion in RNA-seq read counts. Both 
issues are important for accurate exon-level expression 
quantitation. The coverage of RNA-seq reads for any 
given alternative exon is a critical factor for the preci- 
sion of the exon inclusion level estimate [14,56]. 



The importance of accounting for overdispersion in 
RNA-seq data analysis has also been well recognized 
[57]. Methods based on the negative binomial model 
[58,59] or the generalized linear model with Cox-Reid 
dispersion estimators [19,20] have been developed for 
modeling dispersion in detecting differential gene or 
exon expression between biological states. Here in the 
sQTLs analysis, by modeling these two levels of varia- 
tion in RNA-seq read counts, GLiMMPS achieves super- 
ior performance over competing statistical models, as 
demonstrated by analyses of simulated and real RNA- 
seq data. Importantly, even at a low coverage we 
observed a high level of concordance in the GLiMMPS 
results between the two human datasets (CEU and 
CEU2). Additionally, RT-PCR tests of 26 randomly 
selected significant sQTLs yielded a validation rate of 
100%. Together, these results demonstrate that 
GLiMMPS is a robust and improved method to detect 
sQTLs from RNA-seq data. 

Fine-scale analysis of sQTLs reveals positional features 
of SNPs that alter exon splicing. We found that the 
location of the SNPs is strongly correlated with potential 
impact on splicing (Figure 4b). Specifically, SNPs located 
within the 5' and 3' splice sites have the smallest (most 
significant) overall GLiMMPS P values, consistent with 
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the importance of the splice sites in exon recognition 
during pre-mRNA splicing. Interestingly, the significance 
level of sQTLs is positively correlated with the proximity 
of the sQTL signal SNPs to target exons. As we 
increased the significance level cutoff for sQTLs, we 
observed a progressive increase of the proportion of 
sQTLs with at least one significant signal SNP within 
300 bp of the splice sites (Figure 4a). The causal roles of 
these proximal sQTL SNPs on exon splicing were 
further confirmed by minigene splicing reporter assays. 
Collectively, these results support the hypothesis that 
the majority of cis regulatory information controlling 
alternative splicing is encoded in close proximity (for 
example, within 300 bp) of the target exons, consistent 
with a recent analysis of the mammalian splicing code 
[60] . Nonetheless, it should also be noted that 20% of the 
significant sQTLs (FDR <0.1) lack any significant signal 
SNP within 300 bp of the splice sites, including sQTLs 
confirmed experimentally by RT-PCR (in NCAPG2 and 
PIGQ, see Figure S6 in Additional file 1). For such 
sQTLs, it is possible that the causal SNPs are indeed 
proximal, but are missing from current SNP annotations 
or fail to reach the significance level cutoff due to small 
sample size. Alternatively, we cannot rule out the possibi- 
lity that a small fraction of sQTLs are indeed due to 
SNPs disrupting distal splicing regulatory elements, given 
that the physical binding sites of splicing factors on the 
pre-mRNA can be located deep into the introns [61]. In 
the future, it would be interesting to confirm the identity 
and elucidate the regulatory mechanisms of causal sQTL 
SNPs acting in introns distal to target exons. 

The detection of sQTLs is useful for interpreting sig- 
nals from GWAS studies. Despite the success of GWAS 
in revealing the genetic basis of complex traits and 
diseases, elucidating the mechanistic implications of 
GWAS findings remains a major challenge [29]. As 
many functional SNPs may affect gene expression and 
regulation instead of the final protein sequence, inte- 
grating transcriptome information with GWAS signals 
has proven to be an effective approach for pinpointing 
the functional causal variants underlying GWAS signals 
[62-64]. Here, from the CEU RNA-seq dataset we iden- 
tified 140 unique sQTLs, including 10 significantly 
linked to previously identified GWAS signals (Table 1). 
This is probably only scratching the surface of trait- 
associated sQTLs, due to the low sequencing depth 
(28.4-66 million single-end reads per individual) and the 
small sample size (41 individuals). We anticipate that 
with more and deeper RNA-seq data generated for 
diverse human tissues and cell types, the catalog of 
sQTLs linked to phenotypic traits and diseases will 
rapidly expand in the near future. 

The GLiMMPS framework provides the basis for sev- 
eral aspects of future extensions. Currently, GLiMMPS 



uses reads mapped to splice junctions to estimate exon 
inclusion levels. This is a commonly used approach in 
alternative splicing quantitation from RNA-seq data 
[56,65,66]. However, with proper normalization for 
lengths of isoform-specific segments, it is feasible to also 
incorporate reads mapped within the exons, which may 
further improve the power in detecting sQTLs. This 
could be particularly useful for strand-specific RNA-seq, 
where the origins of exon body reads can be unambigu- 
ously assigned to sense or antisense transcripts. Addi- 
tionally, in paired-end RNA-seq data with tight 
distribution of insert size, reads that map to flanking 
constitutive exons can also provide useful information 
about the exon inclusion level [14]. Furthermore, RNA- 
seq reads often display non-uniform distribution along 
mRNA transcripts due to sequence- specific bias in RNA 
sequencing, and several methods have been developed 
to model and correct for such biases [67-70]. In princi- 
ple, we can use a suitable bias correction method to 
adjust the raw RNA-seq read counts, prior to analysis 
by GLiMMPS. However, we tested two well-known bias 
correction methods [67,68] using a deep RNA-seq data- 
set with matching quantitative RT-PCR data for over 
100 exons in two cell lines [66,71], but did not observe 
improvement in the RNA-seq estimates of exon inclu- 
sion level as judged by the correlation of RNA-seq esti- 
mates with the RT-PCR measurements. Another area of 
improvement is to consider the potential impact of spe- 
cific SNPs on exon splicing as the prior in the statistical 
model, an idea previously used for detecting expression 
QTLs [72-74]. For example, our results show a signifi- 
cant association between the SNP position and the 
potential impact on splicing (Figure 4), with SNPs 
located in the 5' and 3' splice sites most likely to influ- 
ence exon splicing. It is possible to incorporate such 
positional information or more advanced predictive 
models of exon splicing [60] as the prior information to 
guide the detection of sQTLs. 

Conclusions 

RNA-seq has become a powerful and increasingly 
affordable technology for population-scale analysis of 
transcriptome variation. Here we report GLiMMPS, a 
robust statistical method for detecting splicing quantita- 
tive trait loci (sQTLs) from RNA-seq data. GLiMMPS is 
applicable to all major patterns of alternative splicing 
events. The GLiMMPS source code and user manual are 
freely available for download at [75]. As the cost of 
high-throughput sequencing continues to decline, we 
anticipate that combined sequencing of genomes and 
transcriptomes will become a popular design in large- 
scale studies of traits and diseases. GLiMMPS provides a 
useful tool for genome-wide identification of sQTLs 
from population-scale RNA-seq datasets. 
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Materials and methods 

RNA-seq datasets 

We downloaded the RNA-seq datasets produced by [18] 
and [17]. Both datasets came from the lymphoblastoid B 
cell lines from the Caucasian (CEU) population in the 
HapMap project [8]. There were 28.4-66 million 50 bp 
single end reads sequenced for 41 individuals by Cheung 
et al., while there were only 3.5-17.1 million 37 bp 
paired end reads for 60 individuals by Montgomery et 
al. We denote the datasets from Cheung et al. and Mon- 
tgomery et al. as CEU and CEU2, respectively. Because 
the sequencing depth in CEU is much higher than in 
CEU2, we focused our analysis on the CEU dataset, but 
also carried out comparison between CEU and CEU2. 
RNA-seq sequence reads were mapped to the reference 
human genome (hgl9) using Tophat [21] with Ensembl 
gene annotations (Ensembl genes r65). The CEU and 
CEU2 datasets were mapped with the single end or the 
paired end mode respectively. Only uniquely mapped 
reads were retained for downstream analysis. 

To search for sQTLs, we first identified all alternative 
splicing events and RNA-seq reads mapped to splice 
junctions using the MATS pipeline as described pre- 
viously [66]. We then focused our analysis on four types 
of alternative splicing events: skipped exon (SE), alterna- 
tive 5' splice site (A5SS), alternative 3' splice site (A3SS), 
and mutually exclusive exons (MXE). Using splice junc- 
tion reads, we can obtain a point estimate of the exon 
inclusion level ((//). Given that we have an observed 
number of splice junction reads for one isoform (y) and 
total splice junction read counts («), then i/f = y/n (see 
Figure SI in Additional file 1). We then filtered out 
exons with no or little change in exon inclusion level 
(|Ai/f| < 0.1) or few total junction read counts (median 
n <5) in the population, and obtained 18,267 AS events 
from CEU and 7,747 AS events from CEU2 for the 
downstream sQTL analysis. 

Genotype data 

The genotype data for the 41 individuals in the CEU 
dataset were taken from the latest HapMap3 release 
(#28). Of these 41 individuals, 23 were also genotyped 
in the 1000 Genomes project [9]. For SNPs uniquely 
reported by the 1000 Genomes project, we imputed the 
genotypes for individuals not in the 1000 Genomes pro- 
ject using Beagle [76]. We filtered out low frequency 
SNPs with MAF (minor allele frequency) <0.05. For 
each alternatively spliced exon, we tested cis SNPs 
within 200 kb upstream or downstream of the target 
exon splice sites when searching for sQTLs. For the 
CEU2 dataset, the 60 individuals were all included in 
the 1000 Genomes project. Fifty-eight of them were 
sequenced in low coverage and two were in high 



coverage. To avoid genotype calling bias, we only 
included the 58 low-coverage individuals with genotypes 
taken directly from the 1000 Genomes project data (10/ 
2010 release). The same MAF filtering was used as in 
the CEU dataset. 

Statistical models for sQTL analysis 

All statistical analyses were done in the R statistical 
environment [77]. We evaluated three different models 
for sQTL analysis: linear model (lm), generalized linear 
model (glm), and our proposed generalized linear mixed 
model (GLiMMPS). The model details were provided in 
Supplementary Methods in Additional file 1. Here we 
only briefly describe the GLIMMPS model. GLiMMPS 
is a hierarchical model that uses the reads information 
from both exon inclusion and skipping isoforms instead 
of only a point estimate of exon inclusion level (as 
in the lm model used in [16,17]) in sQTL analysis. 
Given the observed junction read counts as in Figure SI 
in Additional file 1 we assume that these junction reads 
supporting two alternative isoforms follow a binomial 
distribution: j>;|^; ~ Binomialfa, fa). To deal with the 
overdispersion in the generalized linear model, we model 
the extra variance of y/ as a random effect for each indivi- 
dual i in the regression model with random effects, 
iMj ~ N(0, o*) [22]. Let «y = <r U jZij, where z,j ~ N(0, 1), /}, 
denoting the fixed effect for SNP /, the second level of the 
model can be written as: i/f; = logit _1 (^o + fygij + er u; Zy). 
Thus the joint likelihood for [}, a uj is given by: 



Hp 



■'"-nft)/fi 



exp{(/3 0 + Pjgy + (T„jZ,j)} y 



[l + exp {(ft] + Pjgij + (TujZij)}] 



w N(Zij)dzji 



where N(-) is the standard normal density. Function 
glmer() from R package lme4 was used to fit the model, 
where Laplace approximation is used for the parameter 
estimations and a likelihood ratio test was used to 
obtain the P values for the fixed effect fij for each SNP 

For both the CEU and CEU2 datasets, using each of 
the statistical models (lm, glm, and GLiMMPS) men- 
tioned above, we carried out the analysis for each exon 
with SNPs within 200 kb of the exon. To estimate the 
false discovery rate, we used the same permutation 
approach as in [16] to obtain the null distribution of the 
P values. The details are in Supplementary Methods in 
Additional file 1. 

RT-PCR validation 

To validate the sQTLs found in the CEU datasets, we 
randomly selected 26 significant sQTL exons (FDR <0.1) 
for RT-PCR validation. We performed the validation 
experiments on an independent panel of 86 lymphoblas- 
toid cell lines from the HapMap3 project (Additional 
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file 3), which were purchased from the Coriell Institute 
for Medical Research, Camden, NJ, USA. Total RNA 
was extracted using TRIzol (Invitrogen, Carlsbad, CA, 
USA) and reverse transcribed by the High-Capacity 
cDNA Reverse Transcription Kit (Applied Biosystems, 
Foster City, CA, USA). Fluorescently labeled RT-PCR 
was carried out as described before [24]. Capillary elec- 
trophoresis (Georgia Genomics Facility, Athens, GA, 
USA) and 5% Urea TBE-PAGE were used for resolving 
PCR products. In capillary electrophoresis, band peak 
area was generated by GeneMapper 4.0 software 
(Applied Biosystems, Carlsbad, CA, USA). In 5% Urea 
PAGE, the signal was captured by Fujifilm FLA-7000 
(Fuji Photo Film Co. Ltd., Tokyo, Japan) and quantified 
using the ImageQuant TL7.0 software (General Electric 
Company, Waukesha, WI, USA). Final exon inclusion 
level was calculated as the peak area or band intensity 
of the exon inclusion band(s) divided by the total peak 
areas or band intensities of all bands. To test the asso- 
ciation of genotypes with the RT-PCR estimated exon 
inclusion levels, we used the most significant HapMap3 
sQTL SNP for each target exon. A linear regression on 
the estimated exon inclusion levels with the SNP geno- 
types of the SNP was used to calculate P values and 
those with P value <0.05 were called as validated. All 
RT-PCR primer sequences are listed in Additional file 4 
and individual exon inclusion levels are listed in Addi- 
tional file 5. 

Minigene analysis 

We used the hybrid construct pI-ll-H3 (provided by 
Dr. Russ P. Carstens, University of Pennsylvania, Phila- 
delphia, PA, USA) for our minigene splicing reporter 
assays. Genomic DNAs were extracted from LCLs using 
UltraClean™ Tissue&Cells DNA Isolation kit (MO BIO 
Laboratories, Carlsbad, CA, USA). The target exon and 
its flanking 350-500 bp intronic regions were amplified 
by PCR (see Additional file 6 for the primer sequences). 
In-Fusion™ Advantage PCR Cloning Kit (Clontech, 
Mountain View, CA, USA) or restriction enzyme diges- 
tion and ligation strategy were used to clone PCR pro- 
ducts into the vector. Site-directed mutagenesis was 
carried out following the manufacturer's instructions. 
The integrity of all constructs was confirmed by sequen- 
cing. To test minigene splicing, plasmids were transiently 
transfected into HEK293 cells. Fluorescently labeled RT- 
PCR was performed to evaluate the splicing impact of 
specific polymorphisms as described before [24]. 

GWAS signals 

We obtained 7,523 GWAS SNPs at genome-wide signifi- 
cance level of P value <10' 5 from the Catalog of Pub- 
lished Genome-Wide Association Studies (accessed 03/ 
30/2012) [37]. Using all the 1000 Genomes SNPs from 



the CEU population, we obtained all SNPs that are in 
high linkage disequilibrium with the GWAS SNPs (r 2 >0.8 
in the CEU population and within 200 kb window of the 
GWAS SNP). Because of the high SNP density and high 
recombination rate around the MHC region, we excluded 
genes from this region in this part of the analysis. We 
then identified sQTL signal SNPs overlapped with this 
expanded list of GWAS linked SNPs. 

Data access and source code availability 

The GLiMMPS model has been implemented and 
released in an easy to use package. The splice junction 
read counts, genotypes, and validation datasets, as well 
as the source code used for sQTL processing and analy- 
sis are provided at the companion website of this article 
[75]. 

Additional material 



Additional file 1: Supplementary Methods and Supplementary 
Figures S1-S9 

Additional file 2: Supplementary table SI. sQTL exons (FDR <0.1) and 
information of the most proximal sQTL SNPs. 

Additional file 3: Supplementary table S2 HapMap3 samples used for 
RT-PCR validation of sQTLs. 

Additional file 4: Supplementary table S3 Primers and RT-PCR results 
for validation of sQTLs. 

Additional file 5: Supplementary table S4 The individual exon 
inclusion levels for RT-PCR validation of sQTLs. 

Additional file 6: Supplementary table S5 Primers used for 
constructing minigene splicing reporters. 
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